70 research outputs found
Convergence Analysis of the Approximate Newton Method for Markov Decision Processes
Recently two approximate Newton methods were proposed for the optimisation of
Markov Decision Processes. While these methods were shown to have desirable
properties, such as a guarantee that the preconditioner is
negative-semidefinite when the policy is -concave with respect to the
policy parameters, and were demonstrated to have strong empirical performance
in challenging domains, such as the game of Tetris, no convergence analysis was
provided. The purpose of this paper is to provide such an analysis. We start by
providing a detailed analysis of the Hessian of a Markov Decision Process,
which is formed of a negative-semidefinite component, a positive-semidefinite
component and a remainder term. The first part of our analysis details how the
negative-semidefinite and positive-semidefinite components relate to each
other, and how these two terms contribute to the Hessian. The next part of our
analysis shows that under certain conditions, relating to the richness of the
policy class, the remainder term in the Hessian vanishes in the vicinity of a
local optimum. Finally, we bound the behaviour of this remainder term in terms
of the mixing time of the Markov chain induced by the policy parameters, where
this part of the analysis is applicable over the entire parameter space. Given
this analysis of the Hessian we then provide our local convergence analysis of
the approximate Newton framework.Comment: This work has been removed because a more recent piece (A
Gauss-Newton method for Markov Decision Processes, T. Furmston & G. Lever) of
work has subsumed i
Biases for Emergent Communication in Multi-agent Reinforcement Learning
We study the problem of emergent communication, in which language arises
because speakers and listeners must communicate information in order to solve
tasks. In temporally extended reinforcement learning domains, it has proved
hard to learn such communication without centralized training of agents, due in
part to a difficult joint exploration problem. We introduce inductive biases
for positive signalling and positive listening, which ease this problem. In a
simple one-step environment, we demonstrate how these biases ease the learning
problem. We also apply our methods to a more extended environment, showing that
agents with these inductive biases achieve better performance, and analyse the
resulting communication protocols.Comment: Accepted at NeurIPS 201
Modelling transition dynamics in MDPs with RKHS embeddings
We propose a new, nonparametric approach to learning and representing
transition dynamics in Markov decision processes (MDPs), which can be combined
easily with dynamic programming methods for policy optimisation and value
estimation. This approach makes use of a recently developed representation of
conditional distributions as \emph{embeddings} in a reproducing kernel Hilbert
space (RKHS). Such representations bypass the need for estimating transition
probabilities or densities, and apply to any domain on which kernels can be
defined. This avoids the need to calculate intractable integrals, since
expectations are represented as RKHS inner products whose computation has
linear complexity in the number of points used to represent the embedding. We
provide guarantees for the proposed applications in MDPs: in the context of a
value iteration algorithm, we prove convergence to either the optimal policy,
or to the closest projection of the optimal policy in our model class (an
RKHS), under reasonable assumptions. In experiments, we investigate a learning
task in a typical classical control setting (the under-actuated pendulum), and
on a navigation problem where only images from a sensor are observed. For
policy optimisation we compare with least-squares policy iteration where a
Gaussian process is used for value function estimation. For value estimation we
also compare to the NPDP method. Our approach achieves better performance in
all experiments.Comment: ICML201
Deterministic Policy Gradient Algorithms
International audienceIn this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic pol- icy gradient has a particularly appealing form: it is the expected gradient of the action-value func- tion. This simple form means that the deter- ministic policy gradient can be estimated much more efficiently than the usual stochastic pol- icy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counter- parts in high-dimensional action spaces
- …